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Abstract 

In this paper, we propose several novel deep learning 
methods for object saliency detection based on the pow¬ 
erful convolutional neural networks. In our approach, we 
use a gradient descent method to iteratively modify an in¬ 
put image based on the pixel-wise gradients to reduce a cost 
function measuring the class-specific objectness of the im¬ 
age. The pixel-wise gradients can be efficiently computed 
using the back-propagation algorithm. The discrepancy 
between the modified image and the original one may be 
used as a saliency map for the image. Moreover, we have 
further proposed several new training methods to learn 
saliency-specific convolutional nets for object saliency de¬ 
tection, in order to leverage the available pixel-wise seg¬ 
mentation information. Our methods are extremely com¬ 
putationally efficient (processing 20-40 images per second 
in one GPU). In this work, we use the computed saliency 
maps for image segmentation. Experimental results on two 
benchmark tasks, namely Microsoft COCO and Pascal VOC 
2012, have shown that our proposed methods can generate 
high-quality salience maps, clearly outperforming many ex¬ 
isting methods. In particular, our approaches excel in han¬ 
dling many difficult images, which contain complex back¬ 
ground, highly-variable salient objects, multiple objects, 
and/or very small salient objects. 

1. Introduction 

In the past few years, deep convolutional neural net¬ 
works (DCNNs) [13] have achieved the state of the art per¬ 
formance in many computer vision tasks, starting from im¬ 
age recognition [12, 23, 22] and object localization [2' ] and 
more recently extending to object detection and semantic 
image segmentation [9, 11]. These successes are largely 
attributed to the capacity that large-scale DCNNs can ef¬ 
fectively learn end-to-end from a large amount of labelled 
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images in a supervised learning mode. 

In this paper, we consider to apply the popular deep 
learning techniques to another computer vision problem, 
namely object saliency detection. The saliency detection 
attempts to locate the objects that have the most interests 
in an image, where human may also pay more attention 
on the image [17]. The main goal of the saliency detec¬ 
tion is to compute a saliency map that topographically rep¬ 
resents the level of saliency for visual attention [2 ]. For 
each pixel in an image, the saliency map can provide how 
likely this pixel belongs to the salient objects [4]. Comput¬ 
ing such saliency maps has recently raised a great amount 
of research interest [ ]. The computed saliency maps have 
been shown to be beneficial to various vision tasks, such 
as image segmentation [6], object recognition and visual 
tracking. The saliency detection has been extensively stud¬ 
ied in computer vision. A variety of methods have been 
proposed to generate the saliency maps for images. Un¬ 
der the assumption that the salient objects probably are the 
parts that significantly differ from their surroundings, most 
of the existing methods use low-level image features to de¬ 
tect saliency based on the criteria related to contrast, rar¬ 
ity and symmetry of image patches [6, 17, 18, 4]. In some 
cases, the global topological cues may be leveraged to refine 
the perceptual saliency maps [10, 25, 1 ]. In these meth¬ 
ods, the saliency is normally measured based on different 
mathematical models, including decision theoretic models, 
Bayesian models, information theoretic models, graphical 
models, spectral analysis models [3]. 

In this paper, we propose a novel deep learning method 
for the object saliency detection based on the powerful 
DCNNs. As shown in [12, 23, 2 ], relying on a well- 
trained DCNN, we can achieve a fairly high accuracy in 
object category recognition for many real-world images. 
Even though DCNNs can recognize what objects are con¬ 
tained in an image, it is not straightforward for DCNNs 
to precisely locate the recognized objects in the image. In 
[20, 9, 11], some rather complicated and time-consuming 
post-processing stages are needed to detect and locate the 
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objects for semantic image segmentation. In this work, 
we propose a much simpler and more computationally ef¬ 
ficient method to generate a class-specific object saliency 
map directly from the classification DCNN model. In our 
approach, we use a gradient descent (GD) method to itera¬ 
tively modify each input image based on the pixel-wise gra¬ 
dients to reduce a cost function measuring the objectness of 
the image. The gradients with respect to all image pixels 
can be efficiently computed using the back-propagation al¬ 
gorithm for DCNNs. At the end, the discrepancy between 
the modified image and the original one is calculated as the 
saliency map for this image. Moreover, as more and more 
images with pixel-wise segmentation labels become avail¬ 
able, e.g. [8, 16], we further propose two more methods to 
leverage the available pixel-wise segmentation information 
to learn saliency-specific DCNNs for the object saliency de¬ 
tection. In these methods, the original images as well as 
the corresponding masked images, in which all objects are 
masked out according to the pixel-wise labels, are used to 
train two DCNNs whose output labels are modified to in¬ 
clude the masked objects and/or the original objects. Af¬ 
terwards, we similarly use the GD method to modify each 
input image to reduce two cost functions formulated to mea¬ 
sure the objectness for each case. The saliency map is gen¬ 
erated in the same way as the discrepancy between the orig¬ 
inal and modified images. Since we only need to run a very 
small number of GD iterations in the saliency detection, our 
methods are extremely computationally efficient (process¬ 
ing 20-40 images per second in one GPU). The computed 
saliency maps may be used for many computer vision tasks. 
In this work, as one particular application, we use the com¬ 
puted saliency maps to drive an popular image segmenter 
in [1] to perform image segmentation. Experimental results 
on two databases, namely Microsoft COCO [16] and Pascal 
VOC 2012 [8], have shown that our proposed methods can 
generate high-quality salience maps, clearly outperforming 
many existing methods. In particular, our DCNN-based ap¬ 
proaches excel on many difficult images, containing com¬ 
plex background, highly-variable salient objects, multiple 
objects, and/or very small objects. 

2. Related Work 

In the literature, the previous saliency detection methods 
mostly adopt the well-known bottom-up strategy [6, 17, 18, 
4]. They relies on the local image features derived from 
patches to detect contrast, rarity and symmetry to identify 
the salient objects in an image. Meanwhile, some other 
methods have been proposed to take into account some 
global information or prior knowledge to screen the local 
features. For example, in [25], a boolean map is created to 
represent global topological cues in an image, which in turn 
is used to guide the generation of saliency maps. In [15], 
the visual saliency algorithm considers the prior informa¬ 


tion and the local features simultaneously in a probabilis¬ 
tic model. The algorithm defines task-related components 
as the prior information to help the feature selection proce¬ 
dure. The traditional saliency detection methods normally 
work well for the images containing simple dominant fore¬ 
ground objects in homogenous backgrounds. However, they 
are usually not robust enough to handle images containing 
complex scenes [14]. 

As an important application, the saliency maps may be 
used as a good guidance for various image segmentation al¬ 
gorithms. In [7], a recursive segmentation process is used, 
where each iteration focuses on different saliency regions. 
As a result, the algorithm can output several potential seg¬ 
mentation candidates from the saliency maps. These can¬ 
didates may be further merged by maximizing likelihood 
at all image pixels by considering the low-level features 
like colour and texture. In [6], a region contrast based im¬ 
age saliency method is proposed to generate the saliency 
maps, and the SaliencyCut algorithm is used derive image 
segmentation from the saliency maps. The SaliencyCut al¬ 
gorithm is based on the standard GrabCut [19] but it uses 
the proposed saliency maps instead of manually selected 
bounding boxes for initialization. 

Recently, some deep learning techniques have been pro¬ 
posed for object detection and semantic image segmentation 
[20, 9,1 1]. These methods typically use DCNNs to examine 
a large number of region proposals from other algorithms, 
and use the features generated by DCNNs along with other 
post-stage classifiers to localize the target objects. They ini¬ 
tially rely on bounding boxes for object detection. More 
recently, more and more methods are proposed to directly 
generate pixel-wise image segmentation, e.g. [11]. In this 
paper, instead of directly generating the high-level seman¬ 
tic segmentation from DCNNs, we propose to use DCNNs 
to generate middle-level saliency maps in a very efficient 
way, which may be fed to other traditional computer vision 
algorithms for various vision tasks, such as semantic seg¬ 
mentation, video tracking, etc. 

The work in [21] is the most relevant to the work in this 
paper. In [2 ], the authors have borrowed the idea of expla¬ 
nation vectors in [2] to generate a static pixel-wise gradient 
vector of the network learning objective function, and use 
it as a saliency map. In our work, we instead use an it¬ 
erative gradient descent method to generate more reliable 
and robust saliency maps. More importantly, we have pro¬ 
posed two new methods to learn saliency-specific DCNNs 
and define the corresponding cost functions, which measure 
objectness in each model for salinecy detection. 

3. Our Approach for Object Saliency Detection 

As we have known, DCNNs can automatically learn all 
sorts of features from a large amount of labelled images, 
and a well-trained DCNN can achieve a very good classi- 
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Figure 1. The proposed method to generate the object-specific saliency maps directly from DCNNs. 
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fication accuracy in recognizing objects in images. In this 
work, based on the idea of explanation vectors in [2], we 
argue that the classification DCNNs themselves may have 
learned enough features and information to generate good 
object saliency for the images. Extending a preliminary 
study in [2 ], we explore several novel methods to gener¬ 
ate the saliency maps directly from DCNNs. The key idea 
of our approaches is shown in Figure 1 . After an input im¬ 
age is recognized by a DCNN as containing one particular 
object, if we can modify the input image in such a way that 
the DCNN no longer recognizes the object from it, the dis¬ 
crepancy between the modified image and the original one 
may serve as a good saliency map for the recognized ob¬ 
ject. In this paper, we propose to use a gradient descent 
(GD) method to iteratively modify the input image based 
on the pixel-wise gradients to reduce a cost function for¬ 
mulated in the output layer of the DCNN to measure the 
class-specific objectness. The gradients are computed by 
applying the back-propagation procedure all the way to the 
input layer. 

In section 3.1, we first introduce several different ways 
to learn DCNNs for saliency detection. In section 3.2, we 
present our algorithm used to generate the saliency maps 
from DCNNs in detail. 

3.1. Learning DCNNs for Object Saliency 

Comparing with the traditional bottom-up methods, DC¬ 
NNs may potentially learn more prior information for 
saliency detection. The first type is the class prior, which is 
provided by class labels of all training images. The second 
one is the pixel-wise object prior, which may be available 
as the object masking information in some data sets. 

First of all, the regular classification DCNN may be used 


for saliency detection, which is named as CNN 1 hereafter. 
As shown in Figure 2, CNN1 takes an image as input and it 
contains a node in the output layer for each object category. 
CNNs is trained using all labeled images in the training set. 

If the pixel-wise object masking information is available, 
we may mask out the corresponding objects in the orig¬ 
inal images to generate the so-called masked images. In 
this way, we may learn different DCNNs to learn the pixel- 
wise masking information, which will lead to much better 
DCNNs for the saliency detection purpose. For example, 
we may learn another DCNN with the masked images only, 
named as CNN2. As in Figure 2, CNN2 is trained by us¬ 
ing all masked images in the training set as input and it has 
a node in the output layer corresponding to each masked 
object class. 

Moreover, as shown in Figure 2, we train a slightly mod¬ 
ified DCNN, named as CNN3, with both original labelled 
images as well as all masked images, in which all labelled 
objects are masked out based on the pixel-wise masking. 
For CNN3, we expand its output layer to include two nodes 
for each object category: one for the normal objects and 
the other for the masked objects. For example, when we 
use an original image containing a giraffe to learn CNN3, 
we use the label information corresponding to the regular 
giraffe node in the output layer, denoted as Giraffe. Mean¬ 
while, when we use the same image with the animal region 
masked out, we use the label information corresponding 
to the masked giraffe node in the output layer, denoted as 
Giraffe. Comparing with CNN2, CNN3 is trained in a way 
to learn the contrast information between original labelled 
images and their masked versions. 
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Figure 2. The proposed training procedure to learn DCNNs for object saliency detection. 


3.2. Generating Saliency Maps from DCNNs 

After the three DCNNs (CNN1, CNN2 and CNN3) are 
learned, we may apply our saliency detection methods to 
generate the class-specific object saliency map, as shown in 
Figure 1. 

For each input image, we firstly use CNN1 to generate 
its class label, denoted as l, as in a normal classification 
step. Next, we may use one of the DCNNs to generate the 
saliency map. In this step, the selected DCNN is kept un¬ 
changed and instead we attempt to modify the input image 
in the pixel level to reduce a cost function, which is defined 
to measure the class-specific objectness in each case. In the 
following, we introduce how to define the cost function for 
each DCNN and the details to generate the saliency maps. 

For CNN 1, we denote its output nodes after softmax as 
{;y | i = 1, • • • ,7V}, each of which corresponds to one 
class label (TV classes in total). Assume an input image 
X is recognized as class l, we may define the following 
cost function to measure the class-specific objectness in this 
case: 

^ 1 \X\l)=lnyl 1) . (1) 

The key idea here is that we try to modify the image X 
to reduce the above cost function and hopefully the under¬ 
lying object (belonging to class l) will be removed as the 
consequence. In this paper, we propose to use an iterative 
GD procedure to modify X as follows: 


X (t+1) X w - , 
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dX 
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where e is a learning rate, and we floor all negative gradi¬ 
ents in the GD updates. We have observed in our experi¬ 
ments that the cost function X^\X\l) can be significantly 
reduced by running only a small number of updates (typi¬ 
cally 10-15 iterations) for each image. 

We can easily compute the above gradients using the 
standard back-propagation algorithm. Based on the cost 
function X^ in eq.(l), we can derive the error signals in 
the output layer as = S(i — l) — (i = 1, * • • , TV), 

where £(•) stands for the Kronecker delta function. These 
error signals are back-propagated all the way to the input 

layer to derive the above gradient, ? for saliency 

detection. 

For CNN2, we denote its output nodes after softmax as 
{y | i = 1, • • • , TV}, each of which corresponds to one 
class of masked objects. Given an input image X and its 
recognized class l (from CNN1), we define the following 
cost function for this case: 

^ 2 \X\l) = -lnyl 2) . (3) 

Similarly, we apply the above GD algorithm in eq.(2) to 
modify the image to reduce this cost function. By reducing 
X^ 2 \ we try to increase the probability of the correspond¬ 
ing masked class. Intuitively, we attempt to alter the input 
image to match the masked images in that class as much 
as possible. In the same way, the error signals in the out¬ 
put layer can be simply derived as e- 2 ^ = yf 1 ^ — 8{i — l ) 
(i = 1, • • • , TV), which are back-propagated all the way to 

the input layer to compute 





































































































Algorithm 1 GD based Object Saliency Detection 
Input: an input image X , CNN1, CNN2 and CNN3; 

Use CNN1 to recognize the object label for X as l; 
Choose a saliency model (CNN1 or CNN2 or CNN3); 

X(°) = X; 

for each epoch t = 1 to T do 

forward pass: compute the cost function X(X\l) ; 
backward pass: back-propagate to input layer to com¬ 
pute gradient: Q:F qx^ » 

X® <— — e • max { ^^qx^ ? o)5 

end for 

Average over RGB: S = | 5^ * * 3 =1 (X^ — 

Prune noises with a threshold 0: S = max(S — 0,0); 

Normalize: S = pp 

Output: the raw saliency map S; 


Finally, for CNN3, we denote its output nodes after soft- 
max as {yf^ | i = 1, • • • , 2 TV}, each of which corresponds 
to either an image class or a masked class. Given an input 
input image X and its recognized class /, we find the output 
node corresponding to the masked class of /, denoted as l. 
We define the cost function for CNN3 as follows: 

^ 3 \X\l) = -lnyf i) . (4) 

Similarly, the image is modified by running the GD al¬ 
gorithm in eq.(2) to reduce or equivalently increase 

(3) 

yj } . Since all output nodes are normalized by softmax, 

(3) (3) 

by increasing , its original output node y\ will be re¬ 
duced accordingly. Intuitively speaking, by doing so, we 
attempt to use the contrast information learned by CNN3 to 
modify an image from its original class to match the corre¬ 
sponding masked version for the object saliency detection. 
Similarly, the error signals in the output layer is derived as 
e- 3 ^ = y — 8(i — l), where i = 1, • • • , 2 N. 

At the end of the gradient descent updates, the object 
saliency map is computed as the difference between the 
modified image and the original one, i.e. — X^ T \ For 
colour images, we average the differences over the RGB 
channels to obtain a pixel-wise raw saliency map, which is 
then normalized to be of unit norm. After that, we may ap¬ 
ply a simple threshold to filter out some background noises 
of the raw saliency maps. The entire algorithm to generate 
the raw saliency maps is shown in Algorithm 1. 

For each image, we can obtain 3 different saliency maps 
with the three different DCNNs. We have found that we 
may obtain even better results if we combine the saliency 
maps from CNN2 and CNN3 by taking an average between 
them. We can also use a simple image dilation and erosion 
method to smooth the raw saliency maps to derive the final 
saliency maps. 


4. Saliency Refinement and Image Segmenta¬ 
tion 


Here, as one application, we use the derived saliency 
maps to perform semantic image segmentation. 

Inspired by the recent work in [11], we aim to refine our 
saliency map using segmentation and also achieve a binary 
salient object segmentation. We make use of a recent state- 
of-art image segmentation tool called Multiscale Combina¬ 
torial Grouping (MCG) [1], which provides us with a well- 
defined contour map and also a set of object proposals. The 
idea of refining the saliency map is simple: we randomly 
select 50 points from salient point sets and use these se¬ 
lected points as seed information to perform an interactive 
image segmentation. We restrict it to be a binary segmen¬ 
tation to extract salient foreground. We independently run 
this experiment 100 times and average the binary segmen¬ 
tation results, then we can get a refined saliency. 

To obtain the final binary salient object segmentation, 
we use the top 50 object proposals generated by MCG. For 
each proposal associated with super-pixel segmentation, we 
choose the one with the highest Jaccard index value with a 
thresholded binary mask from the provided saliency map. 
Specifically, given the final saliency map as S, we get a bi¬ 
nary mask AJi=/{S>£}, where S is a threshold (we set 
it to be 0.5 in this work). For each super-pixel segmenta¬ 
tion from each proposal, denoted as M 2 , we calculate the 
Jaccard index as follows: 


Jaccard(Mi, M 2 ) 


\\M 1 {jM 2 \\ 


The super-pixel segmentation that has the largest Jaccard 
index with the thresholded saliency map is chosen as the 
final salient object segmentation. 


5. Experiments 

We select two benchmark databases to evaluate the per¬ 
formance of the proposed object saliency detection and im¬ 
age segmentation methods, namely Microsoft COCO [16] 
and Pascal VOC 2012 [8]. Both databases provide the class 
label of each image as well as the pixel-wise segmenta¬ 
tion map (ground truth), thus we can generate the masked 
images to train the required DCNNs in our propsed meth¬ 
ods. Here we compare our approaches with two exisiting 
methods: i) the first one is the Region Contrast saliency 
method and the Saliency Cut segmentation method in [6]. 
This method is one of the most popular bottom-up im¬ 
age saliency detection methods in the literature and it has 
achieved the state-of-the-art image saliency and segmenta¬ 
tion performance on many tasks; ii) the second one is the 
DCNN based image saliency detection method proposed 
in [21]. Similar to our approaches, this method also use 
DCNNs and the back-propagation algorithm to generate 








saliency maps. In our experiments, we use the precision- 
recall curves (PR-curves) against the ground truth as one 
metric to evaluate the performance of saliency detection. 
As [6], for each saliency map, we vary the cutoff thresh¬ 
old from 0 to 255 to generate 256 precision and recall pairs, 
which are used to plot a PR-curve. Besides, we also use Fp 
to measure the performance for both saliency detection and 
segmentation, which is calculated based on precision Free 
and recall Rec values with a non-negative weight parameter 
/3 as follows [4] : 

_ (1 + /3 2 )Prec x Rec 

In this paper, we follow [6] to set /3 2 =0.3 to emphasize 
the importance of Free. Note that we only get a single Fp 
value for each binary segmentation map for segmentation. 
However, we may derive a sequence of Fp values along the 
PR-curve for each saliency map and the largest one is se¬ 
lected as the performance measure (see [4]). 

5.1. Databases 

Microsoft COCO [16] is a new image database that may 
be used for several vision tasks including image classifi¬ 
cation and segmentation. The database currently contains 
82, 783 training images and 40, 504 validation images with 
80 labeled categories. In our experiments, we only se¬ 
lect the images that contain one category of objects be¬ 
cause these images are more compatible with the available 
DCNN baseline, which is normally trained using the Ima- 
geNet data. The selected COCO subset contains 6869 train¬ 
ing images and 3479 validation images with 18 different 
classes. 

Pascal VOC 2012 database [8] can also be used for our 
proposed algorithms, but its sample size is much smaller 
comparing with COCO. We use the whole dataset, which 
has 1464 training images and 1449 validation images with 
20 label categories in total. For images that are labelled 
to have more than one class of objects, we use the area of 
the labelled objects to measure their importance and use the 
class of the most important object to label the images for 
our DCNN training process. 

As we have mentioned earlier, we need to train the three 
DCNNs, i.e., CNN1, CNN2 and CNN3, for each dataset. 
However, because the training sets are relatively small in 
both COCO and Pascal, we have used a well-trained DCNN 
for the ImageNet database, which contains 5 convolutional 
layers and 2 fully connected layers 1 . We only use the above- 
mentioned training data to fine-tune this DCNN for each 
task with MatConvNet in [24]. For the Pascal VOC 2012 
data, we further use 5-fold cross-validation to expand the 

1 We use the net imagenet-vgg-s in http://www.vlfeat.org/matconvnet/ 

[ 5 ]. 




CNN1 

CNN2 

CNN3 

MS 

COCO 

Top-1 

Top-5 

12.2% 

2.4% 

19.1% 

3.2% 

16.7% 

4.0% 

Pascal 
VOC 2012 

Top-1 

Top-5 

20.3% 

3.1% 

35.1% 

8.4% 

26.5% 

9.7% 


Table 1. The classification error rates of three CNNs on the MS 
COCO and Pascal VOC 2012 test sets. 

training sample size. We use the training set and about 80% 
of the validation data to fine-tune the model and it is used to 
test the remaining 20% of data. We rotate five times to cover 
the entire test set. In Table 1, we have listed the top-1 and 
top-5 classification error rates when the fine-tuned DCNNs 
are used to recognize the test sets on these two tasks. 

The classification errors on the test sets imply that the 
training sample size is still not enough for training deep 
convolutional networks well, especially for Pascal VOC 
2012. However, as we will see, the proposed algorithms 
can still yield good performance for saliency detection and 
segmentation. If we have more training data that include 
class labels and the masked images, we may expect even 
better saliency and segmentation results. 

5.2. Saliency and Segmentation Results 

In this part we will provide saliency detection and seg¬ 
mentation results on these two databases. In the following, 
the PR-curves, Fp values and some sample images will be 
used to compare different methods. 

5.2.1 Microsoft COCO 

For the object saliency detection, we first plot the PR- 
curves for different methods, which are all shown in Fig. 3. 
From the PR-curves, we can see that the performance of 
our proposed saliency detection methods significantly out¬ 
perform the region contrast in [f] and the DCNN based 
saliency method in [2 ]. Moreover, it has shown that CNN2 
and CNN3 yields better performance than CNN1, which 
demonstrates that the utilization of masked images in model 
training can further improve the saliency detection perfor¬ 
mance. 

Figure 4 shows the Fp values of the different saliency 
and segmentation methods, from which we can see that the 
proposed three saliency detection methods give the better 
Fp values than [6] and [21]. Starting from our saliency 
maps, the MCG-based segmentation algorithm can yield a 
good performance as well. Moreover, the segmentation re¬ 
sults have also shown the benefits to use the masked im¬ 
ages as prior information in the DCNN training. Finally, in 
Figure 7 (Column 1 to 5), we also provide some examples 
of the saliency detection and segmentation results from the 
COCO test set. From these examples we can see that the re- 
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Figure 3. The PR-curves of different saliency methods on the MS 
COCO test set. 


gion contrast algorithm does not work well when the input 
images have complex background or contain highly variable 
salient objects, and this problem is fairly common among 
most bottom-up saliency and segmentation algorithms. On 
the other hand, we can also see that with the help of masked 
images in training our proposed DCNN-based saliency de¬ 
tection methods concentrate much better on the salient ob¬ 
jects. Note that the segmentation results based on [2 ] are 
not shown in Figure 7 since they are significantly worse than 
others. 



Region DCNN based CNN1 CNN2 CNN3 CNN2 + CNN3 

Contrast [6] Method [21] 


DCNN models in the Pascal dataset, which is fine-tuned by 
only a very small number of in-domain images. In Fig. 7, 
we also select several Pascal images to show the saliency 
and segmentation results (Column 6 to 10). Some of these 
examples have suggested that our methods are able to han¬ 
dle the images that contain multiple objects. 



Recall 


Figure 5. The PR-curves of different saliency methods on Pascal 
VOC 2012 test set. 
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Contrast [6] Method [21] 


Figure 6. The Fp values of different saliency and segmentation 
methods on Pascal VOC 2012 test set. 


Figure 4. The Fp values of different saliency and segmentation 
methods on MS COCO test set. 


5.2.2 Pascal VOC 2012 

Similarly, we also use PR-curves and Fp to evaluate the 
saliency and segmentation performance on Pascal VOC 
2012 database. From Fig. 5, we can see that the proposed 
methods are significantly better than [21], and the DCNNs 
that make use of masked images yield comparable perfor¬ 
mance as [6]. As shown in Fig. 6, our methods still give 
slightly better Fp values for both saliency detection and 
segmentation than [6] but the difference between them is 
not significant. This may be partially attributed to the poor 


6. Conclusion 

In this paper, we have proposed several novel DCNN- 
based methods for object saliency detection and image seg¬ 
mentation. The methods may utilize both original training 
images and masked images to train several DSCNNs. For 
each test image, we firstly recognize for the image class la¬ 
bel, and then we can use any of the these DCNNs to gener¬ 
ate a saliency map. Specifically, we attempt to reduce a cost 
function defined to measure the class-specific objectness of 
each image, and we back-propagate the corresponding er¬ 
ror signals all way to the input layer and use the gradient of 
inputs to revise the input images. After several iterations, 
the difference between the original input images and the re¬ 
vised images is calculated as a saliency map. The saliency 













































































Figure 7. Saliency Results of MS COCO (Column 1 to 5) and Pascal (Column 6 to 10). (A) original images, (B) masked images, (C) Region 
Contrast saliency maps [6] (D) DCNN based saliency maps by using [ 1], (E) to (H) raw saliency maps using CNN1, CNN2, CNN3 and 
CNN2 + CNN3, (I) smoothed saliency maps of (H) using image dilation and erosion, (J) refined saliency maps of (I), (K) segmentation 
using SaliencyCut [6] and (L) our segmentation results based on (J). 


(D) DCNN based 
method in [1 ] 


(F) Raw saliency 
maps (CNN2) 


(A) Original 
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maps can be used to initialize an image segmentation al¬ 
gorithm to derive the final segmentation results. We have 


evaluated our methods on two benchmark tasks, namely MS 
COCO [16] and Pascal VOC 2012 [ 8 ]. Experimental results 


































































have shown that our proposed methods can generate high- 
quality salience maps, clearly outperforming many existing 
methods. In particular, our DCNN-based approaches excel 
on many difficult images, containing complex background, 
highly-variable salient objects, multiple objects, and very 
small objects. 
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